Mining HTML Pages to Support Document Sharing in a Cooperative System
نویسندگان
چکیده
In this paper, the problem of classifying HTML documents is investigated in the context of a client-server application, named WebClass, developed to support the search activity of a geographically distributed group of people with common interests. The two main issues studied in the paper are the selection of some features to represent HTML documents and the construction of the classifiers. A new feature selection technique is presented and its interaction with different classifiers is experimentally studied. Results show that performance improves even with simple classifiers and the proposed feature selection technique compares favorably with respect to other well-known approaches.
منابع مشابه
Web Classification Approach Using Reduced Vector Representation Model Based on Html Tags
Automatic web page classification plays an essential role in information retrieval, web mining and web semantics applications. Web pages have special characteristics (such as HTML tags, hyperlinks, etc....) that make their classification different from standard text categorization. Thus, when applied to web data, traditional text classifiers do not usually produce promising results. In this pap...
متن کاملWeb Page Structure Enhanced Feature Selection for Classification of Web Pages
Web page classification is achieved using text classification techniques. Web page classification is different from traditional text classification due to additional information, provided by web page structure which provides much information on content importance. HTML tags provide visual web page representation and can be considered a parameter to highlight content importance. Textual keywords...
متن کاملWeb Page Classification: A Soft Computing Approach
The Internet makes it possible to share and manipulate a vast quantity of information efficiently and effectively, but the rapid and chaotic growth experienced by the Net has generated a poorly organized environment that hinders the sharing and mining of useful data. The need for meaningful web-page classification techniques is therefore becoming an urgent issue. This paper describes a novel ap...
متن کاملWeb pages ranking algorithm based on reinforcement learning and user feedback
The main challenge of a search engine is ranking web documents to provide the best response to a user`s query. Despite the huge number of the extracted results for user`s query, only a small number of the first results are examined by users; therefore, the insertion of the related results in the first ranks is of great importance. In this paper, a ranking algorithm based on the reinforcement le...
متن کاملA manually annotated HTML corpus for a novel scientific trend analysis
Here we present a manually annotated corpus of web pages and annotation tool for Web Content Mining. The corpus is extensively annotated, has a hierarchical label structure and is freely available for research purposes. The annotation tool is a Firefox extension which allows the annotator to work with the pages in their original appearance. This tool handles the annotation hierarchy independent...
متن کامل